NicheToolBox: An example of the model selection protocol for ellipsoid models

1 The example

We demonstrate the use of the model selection protocol implemented in ntbox by retrieving occurrence information from the species Leopardus wiedii, a small cat that distributes in the Neotropics (Fig. 1). Then we use a native function of the package to calibrate and select the best models for L. wiedii.

Figure 1. Leopardus wiedii. The image is taken from (Sanchez et al. 1998)

First, we set the random seed in order to make reproducible the example

set.seed(12345)

2 Get the data for the example

From ntbox, we downloaded available occurrences for L. wiedii from the Global Biodiversity Information Facility (https://www.gbif.org) and explore what is the provenance and date of collecting of these points (Fig. 2).

dAll_a <- ntbox::searh_gbif_data(genus = "Leopardus",
                                 species = "wiedii",
                                 occlim = 5000,
                                 leafletplot = TRUE)

We select those records starting in 1950 as we will use the bioclimatic layers from WorldClim for the modeling process.

dAll <- dAll_a %>% dplyr::filter(year>=1950)

2.1 Environmental data

Now, we use the function getData from the package raster (Hijmans 2017) to download the WorldClim at 2.5 ArcMinutes of resolution (~ 4.65 km at Ecuador)

env_layers <- raster::getData('worldclim', var='bio', res=2.5)
plot(env_layers[["bio1"]])

2.2 Crop and mask environmental data

Reading a shapefile for America

amp <- normalizePath("../america_sin_islas")
am <- rgdal::readOGR(dsn =amp,layer="america_sn_islas")

Cut the layers using America as a mask

am_env <- raster::crop(env_layers,am)
am_env <- raster::mask(am_env,am)
plot(am_env[["bio1"]])

2.3 Generate environmental background data

We will generate the random environmental points that will be used to estimate the Partial ROC test (see (Owens et al. 2012; Cobos et al. 2019)) of the calibrated models.

env_bg0 <- ntbox::sample_envbg(envlayers = am_env,
                               nbg = 50000)
## Warning: [ONE-TIME WARNING] Forked processing ('multicore') is disabled
## in future (>= 1.13.0) when running R from RStudio, because it is
## considered unstable. Because of this, plan("multicore") will fall
## back to plan("sequential"), and plan("multiprocess") will fall back to
## plan("multisession") - not plan("multicore") as in the past. For more
## details, how to control forked processing or not, and how to silence this
## warning in future R sessions, see ?future::supportsMulticore

3 Curate data from duplicated records

First, we extract environmental information from occurrences

dAll_env <- raster::extract(am_env,dAll[,2:3])
dAll_genv <- data.frame(dAll[,c(2:3,ncol(dAll))],
                        dAll_env)

We remove duplicated data

dAll_genv_clean <- ntbox::clean_dup(dAll_genv,
                                    longitude ="longitude",
                                    latitude = "latitude",
                                    threshold = res(am_env)[1])
# remove NA's
dAll_genv_clean <- na.omit(dAll_genv_clean)

See the data on a leaflet map

m <- leaflet::leaflet(dAll_genv_clean)
m <- m %>% leaflet::addTiles()
m <- m %>% leaflet::addCircleMarkers(lng = ~longitude, 
                                     lat = ~latitude, 
                                     popup = ~leaflet_info, 
                                     fillOpacity = 0.25, 
                                     radius = 7)
m

Remove wired occurrences. Click on the pop-up to display gbif information (available when the downloaded data comes from search_gbif function), the points that are outside the distribution are the one on San Francisco (this comes from a collection; rowID=400), the record on Florida (rowID=244) and the point that its on latitude and longitude zero (rowID=310).

rnames <- as.numeric(rownames(dAll_genv_clean))
# Indixes of the wired data (can change depending the date of the gbif query)
to_rmIDs <- c(400,244,310)
to_rm <- which(rnames %in% to_rmIDs)
dAll_genv_clean <- dAll_genv_clean[-to_rm,]

Explore the curated data

m_cur <- leaflet::leaflet(dAll_genv_clean)
m_cur <- m_cur %>% leaflet::addTiles()
m_cur <- m_cur %>% leaflet::addCircleMarkers(lng = ~longitude, 
                                             lat = ~latitude, 
                                             popup = ~leaflet_info, 
                                             fillOpacity = 0.25, 
                                             radius = 7)
m_cur

4 Split the data in train and testing

Now we will create train and testing data using a proportion of 70:30 respectively

trainID <- sample(nrow(dAll_genv_clean),
                  size =ceiling(nrow(dAll_genv_clean)*0.7))

Geographic train and test data

dtrain <- dAll_genv_clean[trainID,1:2]
dtest <-  dAll_genv_clean[-trainID,1:2]

Environmental train and test

dtrain_env <- dAll_genv_clean[-trainID,-c(1:3)]
dtest_env <- dAll_genv_clean[trainID,-c(1:3)]

5 Remove strongly correlated variables

First estimate correlation matrix

corsMat <- cor(dtrain_env)

Select environmental variables using a correlation filter of 0.95

env_vars <- ntbox::correlation_finder(corsMat,
                          threshold = 0.95,
                          verbose = FALSE)$descriptors

6 Ellipsoid calibration and selection

For calibrating Minimum Volume Ellipsoid Models (Van Aelst and Rousseeuw 2009) we will use a proportion of 0.95 of the training data for 3,5 and 6 variables. The omission rate criteria is 0.05 (5%). The models will be calibrated and selected in parallel; each job will process 100 models.

nvarstest <- c(3,4,5)
t1 <- system.time({
  e_selct <- ntbox::ellipsoid_selection(env_train = dtrain_env,
                                        env_test = dtest_env,
                                        env_vars = env_vars,
                                        level = 0.95,
                                        nvarstest = nvarstest,
                                        env_bg = env_bg0,
                                        omr_criteria=0.05,
                                        parallel = TRUE,
                                        comp_each = 100,proc = TRUE)
})
## -----------------------------------------------------------------------------------------
##      **** Starting model selection process ****
## -----------------------------------------------------------------------------------------
## 
## A total number of 455 models will be created for combinations of 15 variables taken by 3 
## 
## A total number of 1365 models will be created for combinations of 15 variables taken by 4 
## 
## A total number of 3003 models will be created for combinations of 15 variables taken by 5 
## 
## -----------------------------------------------------------------------------------------
##   **A total number of 4823 models will be tested **
## 
## -----------------------------------------------------------------------------------------
## Doing calibration from model  1 to  100 in process  1 
## 
## Doing calibration from model  101 to  200 in process  2 
## 
## Doing calibration from model  201 to  300 in process  3 
## 
## Doing calibration from model  301 to  400 in process  4 
## 
## Doing calibration from model  401 to  500 in process  5 
## 
## Doing calibration from model  501 to  600 in process  6 
## 
## Doing calibration from model  601 to  700 in process  7 
## 
## Doing calibration from model  701 to  800 in process  8 
## 
## Doing calibration from model  801 to  900 in process  9 
## 
## Doing calibration from model  901 to  1000 in process  10 
## 
## Doing calibration from model  1001 to  1100 in process  11 
## 
## Doing calibration from model  1101 to  1200 in process  12 
## 
## Doing calibration from model  1201 to  1300 in process  13 
## 
## Doing calibration from model  1301 to  1400 in process  14 
## 
## Doing calibration from model  1401 to  1500 in process  15 
## 
## Doing calibration from model  1501 to  1600 in process  16 
## 
## Doing calibration from model  1601 to  1700 in process  17 
## 
## Doing calibration from model  1701 to  1800 in process  18 
## 
## Doing calibration from model  1801 to  1900 in process  19 
## 
## Doing calibration from model  1901 to  2000 in process  20 
## 
## Doing calibration from model  2001 to  2100 in process  21 
## 
## Doing calibration from model  2101 to  2200 in process  22 
## 
## Doing calibration from model  2201 to  2300 in process  23 
## 
## Doing calibration from model  2301 to  2400 in process  24 
## 
## Doing calibration from model  2401 to  2500 in process  25 
## 
## Doing calibration from model  2501 to  2600 in process  26 
## 
## Doing calibration from model  2601 to  2700 in process  27 
## 
## Doing calibration from model  2701 to  2800 in process  28 
## 
## Doing calibration from model  2801 to  2900 in process  29 
## 
## Doing calibration from model  2901 to  3000 in process  30 
## 
## Doing calibration from model  3001 to  3100 in process  31 
## 
## Doing calibration from model  3101 to  3200 in process  32 
## 
## Doing calibration from model  3201 to  3300 in process  33 
## 
## Doing calibration from model  3301 to  3400 in process  34 
## 
## Doing calibration from model  3401 to  3500 in process  35 
## 
## Doing calibration from model  3501 to  3600 in process  36 
## 
## Doing calibration from model  3601 to  3700 in process  37 
## 
## Doing calibration from model  3701 to  3800 in process  38 
## 
## Doing calibration from model  3801 to  3900 in process  39 
## 
## Doing calibration from model  3901 to  4000 in process  40 
## 
## Doing calibration from model  4001 to  4100 in process  41 
## 
## Doing calibration from model  4101 to  4200 in process  42 
## 
## Doing calibration from model  4201 to  4300 in process  43 
## 
## Doing calibration from model  4301 to  4400 in process  44 
## 
## Doing calibration from model  4401 to  4500 in process  45 
## 
## Doing calibration from model  4501 to  4600 in process  46 
## 
## Doing calibration from model  4601 to  4700 in process  47 
## 
## Doing calibration from model  4701 to  4800 in process  48 
## 
## Doing calibration from model  4801 to  4823 in process  49 
## 
## Finishing calibration of models  3801 to  3900 
## 
## Finishing calibration of models  3901 to  4000 
## 
## Finishing calibration of models  4001 to  4100 
## 
## Finishing calibration of models  4101 to  4200 
## 
## Finishing calibration of models  4201 to  4300 
## 
## Finishing calibration of models  4301 to  4400 
## 
## Finishing calibration of models  4401 to  4500 
## 
## Finishing calibration of models  4501 to  4600 
## 
## Finishing calibration of models  4601 to  4700 
## 
## Finishing calibration of models  4701 to  4800 
## 
## Finishing calibration of models  4801 to  4823 
## 
## Finishing calibration of models  1 to  100 
## 
## Finishing calibration of models  101 to  200 
## 
## Finishing calibration of models  201 to  300 
## 
## Finishing calibration of models  301 to  400 
## 
## Finishing calibration of models  401 to  500 
## 
## Finishing calibration of models  501 to  600 
## 
## Finishing calibration of models  601 to  700 
## 
## Finishing calibration of models  701 to  800 
## 
## Finishing calibration of models  901 to  1000 
## 
## Finishing calibration of models  801 to  900 
## 
## Finishing calibration of models  1001 to  1100 
## 
## Finishing calibration of models  1101 to  1200 
## 
## Finishing calibration of models  1201 to  1300 
## 
## Finishing calibration of models  1301 to  1400 
## 
## Finishing calibration of models  1401 to  1500 
## 
## Finishing calibration of models  1501 to  1600 
## 
## Finishing calibration of models  1601 to  1700 
## 
## Finishing calibration of models  1701 to  1800 
## 
## Finishing calibration of models  1801 to  1900 
## 
## Finishing calibration of models  1901 to  2000 
## 
## Finishing calibration of models  2001 to  2100 
## 
## Finishing calibration of models  2101 to  2200 
## 
## Finishing calibration of models  2201 to  2300 
## 
## Finishing calibration of models  2301 to  2400 
## 
## Finishing calibration of models  2401 to  2500 
## 
## Finishing calibration of models  2501 to  2600 
## 
## Finishing calibration of models  2601 to  2700 
## 
## Finishing calibration of models  2701 to  2800 
## 
## Finishing calibration of models  2801 to  2900 
## 
## Finishing calibration of models  2901 to  3000 
## 
## Finishing calibration of models  3001 to  3100 
## 
## Finishing calibration of models  3101 to  3200 
## 
## Finishing calibration of models  3201 to  3300 
## 
## Finishing calibration of models  3301 to  3400 
## 
## Finishing calibration of models  3401 to  3500 
## 
## Finishing calibration of models  3501 to  3600 
## 
## Finishing calibration of models  3601 to  3700 
## 
## Finishing calibration of models  3701 to  3800 
## 
## Finishing...
## 
## -----------------------------------------------------------------------------------------
##   246 models passed omr_criteria for train data
##   52 models passed omr_criteria for test data
##   13 models passed omr_criteria for train and test data
# Save the results
write.csv(e_selct,"margay_model_selection_results.csv",
          row.names = FALSE)

The elapsed time in minutes

t1/60
##        user      system     elapsed 
##  0.24813333  0.01436667 40.67981667

Now we show the results for 10 best models. The table contains eleven fields:

  1. fitted_vars: The fitted variables
  2. nvars: Number of variables used to fit the ellipsoid model
  3. om_rate_train: Omission rate of training data
  4. om_rate_test: Omission rate of testing data
  5. bg_prevalence: The estimated prevalence of the species in background data
  6. pval_bin: The p-value of the binomial test (see (Peterson, Papes, and Soberon 2008)) performed in environmental space
  7. pval_proc: The p-value of the partial ROC test performed in environmental space
  8. env_bg_aucratio: Environmental background AUC ratio
  9. mean_omr_train_test: Mean omission rate of testing and training data
  10. rank_by_omr_train_test: The rank of the models given testing and training data
knitr::kable(head(e_selct,10))
fitted_vars nvars om_rate_train om_rate_test bg_prevalence pval_bin pval_proc env_bg_aucratio mean_omr_train_test rank_by_omr_train_test rank_omr_aucratio
bio4,bio9,bio13 3 0.0379747 0.0427807 0.3023024 0 0 1.524967 0.0403777 13 1
bio4,bio6,bio12,bio19 4 0.0126582 0.0481283 0.2999061 0 0 1.517946 0.0303933 1 2
bio4,bio6,bio12 3 0.0379747 0.0481283 0.3099902 0 0 1.515569 0.0430515 23 3
bio4,bio7,bio12 3 0.0379747 0.0481283 0.3250265 0 0 1.509778 0.0430515 24 4
bio6,bio8,bio13,bio19 4 0.0253165 0.0481283 0.3154216 0 0 1.502200 0.0367224 4 5
bio7,bio8,bio13,bio19 4 0.0253165 0.0481283 0.3247269 0 0 1.491501 0.0367224 5 6
bio1,bio6,bio13,bio19 4 0.0253165 0.0481283 0.3316760 0 0 1.475350 0.0367224 6 7
bio7,bio13,bio19 3 0.0379747 0.0427807 0.3759660 0 0 1.445673 0.0403777 14 8
bio3,bio7,bio12,bio18,bio19 5 0.0253165 0.0481283 0.3643643 0 0 1.436196 0.0367224 7 9
bio2,bio9,bio13 3 0.0379747 0.0481283 0.3907826 0 0 1.422951 0.0430515 25 10

7 Project the best model

We will project the best model according to the above table. For another complete example see the help of ?ntbox::ellipsoid_selection.

# Select the model number  one in table e_select
bestvarcomb <- stringr::str_split(e_selct$fitted_vars,",")[[1]]

Fit the model in environmental space.

best_mod <- ntbox::cov_center(dtrain_env[,bestvarcomb],
                              mve = T,
                              level = 0.99,
                              vars = 1:length(bestvarcomb))
mProj <- ntbox::ellipsoidfit(am_env[[bestvarcomb]],
                             centroid = best_mod$centroid,
                             covar = best_mod$covariance,
                             level = 0.99,size = 3)
if(length(bestvarcomb)==3)
  rgl::rglwidget()

Project the model in geographic space

raster::plot(mProj$suitRaster)

References

Cobos, Marlon E., A. Townsend Peterson, Narayani Barve, and Luis Osorio-Olvera. 2019. “kuenm: an R package for detailed development of ecological niche models using Maxent.” PeerJ 7 (February): e6281. https://doi.org/10.7717/peerj.6281.

Hijmans, Robert J. 2017. “raster: Geographic analysis and modeling with raster data.” http://cran.r-project.org/package=raster.

Owens, H, L Campbell, L Dornak, E Saupe, N Barve, J Soberon, K Ingenloff, A Lira-Noriega, C Hensz, and C Myers. 2012. “Constraints on Interpretation of Ecological Niche Models by Limited Environmental Ranges on Calibration Areas: Software Script Appendix.”

Peterson, A. Townsend, Monica Papes, and Jorge Soberon. 2008. “Rethinking receiver operating characteristic analysis applications in ecological niche modeling.” Ecol. Modell. 213 (1): 63–72.

Sanchez, O., M. A. Pineda, H. Benitez., B. González, and H. Berlanga. 1998. Guía de identificación para las aves y mamíferos silvestres de mayor comercio en México protegidos por la C.I.T.E.S. Edited by SEMARNAP and CONABIO. México.

Van Aelst, Stefan, and Peter Rousseeuw. 2009. “Minimum volume ellipsoid.” Wiley Interdiscip. Rev. Comput. Stat. 1 (1): 71–82. https://doi.org/10.1002/wics.19.

Luis Osorio-Olvera, Andrés Lira-Noriega, Jorge Soberón, Manuel Falconi, A. Townsend Peterson, Rusby Guadalupe Díaz-Contreras, and Enrique Martinez-Meyer

2019-09-29